You should generally schedule 4x the amount of threads as can be run simultaneously - gpus are good at context switching so you should make sure they always have work to do when waiting.
L3/L2 cache is managed by the hardware, but the L1 cache is handled by the user manually.
CUDA offers lots of helpful things through their CCCL libraries.
Each host has one or more CUDA devices (GPU), each CUDA device consists of multiple Compute Units, each Compute Unit consists of multiple Processing Elements.
We have blocks and workgroups. A workgroup consists of multiple blocks, work inside a block runs at the same time, but blocks within a workgroup may run at different times. Consequently you can sync within blocks but not workgroups.
Maximum number of threads in a block is 1024. Threads are also grouped into warps of size 32. Threads in a warp should run the same code at the same time to avoid warp divergence.
Functions are by default host functions, meaning they run on the host. Adding the __device__ attribute to the function means it will be run on the device.
You invoke kernels from the host using this syntax <<<blocks, threads, shared, cudaStream_t stream>>>. E.g. vadd<<<blocksPerGrid, threadsPerBlock>>>(d_a, d_b, d_c)
Once you’ve launched your kernel you should cudaDeviceSynchronize() to wait for the kernel to finish.
CUDA has a memcpy that allows you to copy data between devices and/or hosts.
It has a handful of functions for getting errors, including just reading the result of cudaDeviceSynchronize().